Multilingual Summarization: Dimensionality Reduction and a Step Towards Optimal Term Coverage

نویسندگان

  • John Conroy
  • Sashka T. Davis
  • Jeff Kubina
  • Yi-Kai Liu
  • Dianne P. OLeary
  • Judith D Schlesinger
چکیده

In this paper we present three term weighting approaches for multi-lingual document summarization and give results on the DUC 2002 data as well as on the 2013 Multilingual Wikipedia feature articles data set. We introduce a new intervalbounded nonnegative matrix factorization. We use this new method, latent semantic analysis (LSA), and latent Dirichlet allocation (LDA) to give three term-weighting methods for multi-document multi-lingual summarization. Results on DUC and TAC data, as well as on the MultiLing 2013 data, demonstrate that these methods are very promising, since they achieve oracle coverage scores in the range of humans for 6 of the 10 test languages. Finally, we present three term weighting approaches for the MultiLing13 single document summarization task on the Wikipedia featured articles. Our submissions significantly outperformed the baseline in 19 out of 41 languages. 1 Our Approach to Single and Multi-Document Summarization The past 20 years of research have yielded a bounty of successful methods for single document summarization (SDS) and multi-document summarization (MDS). Techniques from statistics, machine learning, numerical optimization, graph theory, and combinatorics are generally languageindependent and have been applied both to single and multi-document extractive summarization of multi-lingual data. In this paper we extend the work of our research group, most recently discussed in Davis et al. (2012) for multi-document summarization, and apply it to both single and multi-document multilingual document summarization. Our extractive multi-document summarization performs the following steps: 1. Sentence boundary detection; 2. Tokenization and term identification; 3. Term-sentence matrix generation; 4. Term weight determination; 5. Sentence selection; 6. Sentence ordering. Sentence boundary detection and tokenization are language dependent, while steps (3)-(6) are language independent. We briefly discuss each of these steps. We use a rule based sentence splitter FASSTE (very Fast, very Accurate Sentence Splitter for Text – English) (Conroy et al., 2009) and its multilingual extensions (Conroy et al., 2011) for determining the boundary of individual sentences. Proper tokenization improves the quality of the summary and may include stemming and also morphological analysis to disambiguate compound words in languages such as Arabic. Tokenization may also include stop word removal. The result of this step is that each sentence is represented as a sequence of terms, where a term can be a single word, a sequence of words, or character n-grams. The specifics of tokenization are discussed in Section 2. Matrix generation (the vector space model) was pioneered by Salton (1991). Later Dumais (1994) introduced dimensionality reduction in document retrieval systems, and this approach has also been

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Vector Space Models for Scientific Document Summarization

In this paper we compare the performance of three approaches for estimating the latent weights of terms for scientific document summarization, given the document and a set of citing documents. The first approach is a termfrequency (TF) vector space method utilizing a nonnegative matrix factorization (NNMF) for dimensionality reduction. The other two are language modeling approaches for predicti...

متن کامل

Dimensionality Reduction Aids Term Co-Occurrence Based Multi-Document Summarization

A key task in an extraction system for query-oriented multi-document summarisation, necessary for computing relevance and redundancy, is modelling text semantics. In the Embra system, we use a representation derived from the singular value decomposition of a term co-occurrence matrix. We present methods to show the reliability of performance improvements. We find that Embra performs better with...

متن کامل

Multilingual Summarization Experiments on English, Arabic and French (Résumé Automatique Multilingue Expérimentations sur l'Anglais, l'Arabe et le Français) [in French]

The task of multilingual summarization aims to design free-from language systems. Extractive methods are in the core of multilingual summarization systems. In this paper, we discuss the influence of various basic NLP tasks: sentence splitting, tokenization, stop words removal and stemming on sentence scoring and summaries' coverage. Hence, we propose a statistical method which extracts most rel...

متن کامل

Dimensionality Reduction with Multilingual Resource

Query and document representation is a key problem for information retrieval and filtering. The vector space model (VSM) has been widely used in this domain. But the VSM suffers from high dimensionality. The vectors built from documents always have high dimensionality and contain too much noise. In this paper, we present a novel method that reduces the dimensionality using multilingual resource...

متن کامل

Task Knowledge in Abstractive Summarization

This paper discusses the path towards asbtractive summarization and proposes a new knowledge-based methodology called KBABS as a step forward on this path. We propose to use both world knowledge, to identify useful content, and task knowledge, to filter out unreliable content, to generate more accurate summaries. This approach was implemented for guided summarization. The evaluation shows that,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013